1 Introduction
The value-based methods previously presented all presuppose a known world model (i.e., the transition probability distribution is known). However, in most RL applications—especially in the real world—the world model isn’t known; therefore we rely on estimation by having the agent interact with the environment and using the sampled data from those interactions.
There are two well-tested methods used in this regard: Monte Carlo control and Temporal Difference learning.
2 Monte Carlo control
Monte Carlo control consists of two steps:
-
Estimation: we roll out a policy from an initial state (or from an initial state–action pair if
we’re estimating the Q-function), collect rewards, and—once the discount factor makes later rewards negligible
or the episode ends—we update our estimate of the value function using the average discounted return:
\[ V(s_t) \leftarrow V(s_t) + \eta \bigl[ G_t - V(s_t) \bigr]. \] -
Improvement: once we have an estimate, we use policy iteration to learn an optimal policy by making
a greedy selection on update:
\[ \pi_{k+1}(s) = \arg\max_a Q_k(s,a), \] with \( Q_k(s,a) \) approximated with Monte Carlo estimation.
3 Temporal Difference (TD) Learning
Temporal Difference learning reduces Bellman error for sampled states (or state–action pairs) during transitions:
\[ V(s_t) \leftarrow V(s_t) + \eta \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right], \]
where \( V(s_t) \) is moved toward the target value \( r_t + \gamma V(s_{t+1}) \).
4 Combining Monte Carlo and TD
Monte Carlo and TD are often combined. TD updates at each transition, whereas MC waits until an episode finishes (or the horizon is effectively reached). We can keep the TD estimation at each transition and, for a chosen \( n \), approximate the remainder of the trajectory with a weighted average of \( n \)-step returns parameterized by \( \lambda \in [0,1] \):
\[ G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}. \]
This is the lambda-return, which we use in place of \( G_{t:t+n} \) for subsequent TD updates. Notable cases:
- \( \lambda = 1 \): the lambda-return is the standard MC return.
- \( \lambda = 0 \): the lambda-return is the standard TD(0) return; combining MC and TD this way is termed TD(\(\lambda\)).
TD policy learning can be done on-policy (SARSA) or off-policy (Q-learning). We’ll expand on both and provide simple implementations in the next article.